Analysis of Natural and Human-Made Disasters Worldwide

By Taha Al-Nufaili



Importance


Natural disasters have been around for the whole of known human history, and recently, human populations started to suffer from human made disasters in addition to the natural ones. We want to know how these disasters are affecting humanity, and whether they are an increasing risk or whether their risk is diminishing.

Purpose and Goal


This tutorial aims to raise awareness of the effects of disasters by exploring how natural and human made disasters are affecting humanity. The tutorial will focus on exploring whether the number of occurences corresponds to the rate of fatalities. It will also attempt to find any correlations between a country's income with the number of disasters and fatalities caused by disasters.

Data Collection


At this phase, we aim to find relavent data that can help us have better understanding about disasters and economical development worldwide. The dataset we are looking for should describe the number of deaths caused by human-made and natural disasters. We are also looking for a dataset that can show us the development of economies worldwide. Additionally, both datasets should have some indication of time.
Furhtermore, for more accurate and trustworthy results, it is important to base our tutorial on data that is made by trustworthy sources.

In this tutorial we will use the following datasets:

Before we do anything, lets import all the libraries that we need for our tutorial...

Libraries are used to save time and effort. They let us use other people's code that simplifies for us data manipulation, visualization, and many countless other applications.

Data Processing


This is the second phase of the data science pipeline. In this phase, I will go through the steps of preparing the data to be used for later phases of the data science pipeline. That includes manipulating our tables such that they are tidy and only contain the information that is relevant to our purposes. This also includes dealing with missing data and making sure that our dataframe stores the data in the right types. For example, for better results, dates should be stores as Datetime objects instead of just strings.

Read and store data as pandas dataframe objects

Pandas is a library that is built on python. It is very popular in the world of data science because of its extensive capabilities and wide range of uses. I will be using pandas to store our datasets into dataframes, which are objects that are easily manipulated with many pre built functions by pandas.
I encourage you to learn more about pandas if you are not famliar with it. It will make your life MUCH easier and more efficient!
Here is pandas documentation to help you get started.
NOTE: run the following to read the files only once! Our datasets are very large and reading the files can consume computation power and time.

Now that we have the datasets as Pandas dataframes, we can start manipulating and beautifying them to prepare them for our purposes in later phases of the data science pipeline.

Lets start with: World Development Indicators (gdp_df)

Drop unwanted rows

Drop invalid rows

Drop GDP growth per capita rows

Drop Series Code column

Melting the table


Tidying the table is very important for easier reading and plotting of the data. To have a tidy table, we must let each variable have their own column, each row represents an observation, and each cell represents a single value. Learn more about what is Tidy Data.

Since we have each date as a different column name in gdp_df, we need to melt the dataframe to make our data tidy. In this step, we will simply convert the columns that represents years into one column called "Date". So our table will be much more vertical. Luckily, pandas provides us with a simple and easy way to do this, using the pandas.melt() function.

To use this function you can specify the column names that you want to be unchanged in the parameter id_vars. You can then assign specify what the name of the two new columns will be with var_name and value_name parameters.
Feel free to learn more about melting with pandas.



NOTE: run the following code only once! Using the pandas.melt() function will exponencially increase the vertical length of the dataframe. Running the code more than once makes the results either have duplicates or not make sense.

Each column is a variable


Now that we have our data melted, we need to change the table so that each column represents a variable. In this table, we can see that each of GDP per capit and GDP should have a column. So, we want to delete the Series Name column, and add two columns "GDP per capita" and "GDP". To accoplish this, once again, Pandas library is here for the rescue!
For this step, we will use the pandas.set_index() function. It lets you set the columns as indixes such that you can unstack the wanted index(Series Name in our case) using the unstack() function.
After unstacking, the dataframe will have 0 columns but 5 indexes because of the set_index() funciton. So we need to reset our indexes using the .reset_index() funciton. After that, we will have 5 columns, but now, we want to rename some of the columns that have verbose names; For example, we can rename "GDP (current US$)" to "GDP".

Hurray!

We are almost there!


Now the gdp_df table looks much cleaner and nicer. Each row represents an observation and each column represents a variable. However we are still not done. We need to make sure the values are formatted in the right way and stored in the correct types!

Convert Date column entries to Datetime objects


Datetime is a python library that helps graph, plot, and process dates more easily. The Date column entries are stored as String in our table. We will convert the entris of the Date column to Datetime objects.

Convert numerical values into ints and empty values into NaN objects


Since missing values are represented as ".." and numerical values as Strings, we will have to process the GDP and GDP Per Capita columns. We can deal with NaN objects(aka missing values) later.

Only include years that are greater than or equal 1970

Just to make our findings a little bit more relevant, lets remove rows with events happening before 1970.

yess!

We are now done processing the first dataframe!!


The table is now tidy and the data types are as we want them!
Now we can move forward to process the next dataframe... (disas_df)

Drop unwanted columns

Drop Unwanted Rows


Since the first dataset records information from 1970 until 2020, we will drop rows with years less than 1970 and greater than 2020. Also, the rows with subgroup: Extra-terrestrial are not importnat. Also, in this tutorial, we will drop the rows with missing deaths entries.

Process the Dates


Here, we will combine Start Year, Start Month, and Start Day columns into a Date column. Once again, we will use the Datetime library for the dates.

Drop not needed columns


Since we converted the dates into datetime objects in Date column, we no longer need the Start Year, Start Month, and Start Day columns.

disas_df.head()

Adjust the types of Latitude and Longitude


Since the table stored these two column entries as Strings, we should adjust that and conver the values to floats. Also, missing values(or NaN) should be stored as NaN object, not a String.

Adjust the missing values of Disaster Subtype column


Missing values are stored as "NaN" string, but we should convert them to NaN objects.

Make sure the entries of the ISO columns are stored as string


We will use this column for merging and joining the dataframes. The merge and join functions require that the data is merged on strings, ints, or other native types. More on that in the next phase.

yess!

Now that our tables look clean and nice, we can proceed to the next phase...

Exploratory Analysis & Data Visualization


At the third phase of the data science pipeline, we explore our data. This is usually done by visualizing the data in many different ways. The goal here is to find some interesting trends and relationships in the data that we have. We will plot many plots and many graphs to hopefully discover some interesting information.

Lets plot the number of deaths over time

cant see!

We have outliers!


In this situation, we can trim 1%, 3%, or even 10% of our data based on Total Deaths values. However, we can also change the scale of our plot such that we can see all the values clearly. For that comes the log function!!!


In this tutorial, we will demonstrate both approaches for you :)

Lets plot again with the log function...

Better!

For now, let's compare the number of disasters by Group

As expected, it seems that Technological disasters happen in less frequencies compared to natural disasters.

Lets use a pie plot to further explore how the disasters are distributed

Hmmm, interesting, lets see how the pie looks like if we based our plot on deaths instead of occurences.

Interestingly, we can see that although Technological disasters happen 44% of the time, they are only responsible for 18% of deaths. Also, Climatological disasters seem very deadly.

It appears that our data has some outliers that are making our data too skewed. To deal with outliers, we can trim the lower and upper 10% of each disaster's subgroup.

Dropping Outliers by Trimming 1%

We can see that the most extreme of the outliers were removed

Deaths by Subgroup


Here, we want to see how each subgroup is repsonsible for deaths. Luckily, pandas provides us with groupby() function, which groups our data based on a catagory we specify, "Disaster Subgroup" in this case. We can then use the .sum() function to sum the entries grouped by Disaster Subgroup. So, we will have the following:

Interestingly, we can see that deaths by technological disasters suddenly increased after trimming

What about deaths by Disaster Subgroup over time


To find out what sub group is contributing to the fast increase between 1980 and 2010, we can see our data by subgroup.

To plot by commulative deaths by subgroup, we need to sum each subgroup individually, which is what the following does...

We can see from the graphs above that technological disasters have skyrocketed in recent years, and they are mostly responsible for the jump in deaths since the 1980s until around 2010. That is because we can observe exponential growth in the number of deaths because of technological disasters in that period

Now, let's see how breakdown of deaths by Disaster Type looks like

We can see that drought, wildfire, volcanic activity and mass movement disasters are not that frquent and are not contributing to many deaths. In this case, we can combine these and label them as "Other Natural" since they are all natural disasters.

Disasters by Continent

Just like we did before, we will use the Pandas.groupby() funciton. It will let us easily sum or count or do many other opetions on our dataframe. Learn more about Pandas.groupby(). To get the names of the grouped column, we can run df.index.array.

Interestingly, the deaths and occurences plots look very similar. I think that was because we trimmed the outliers, so the average disaster is typically similar in the casualties accross continents.

To understand our data more, lets plot the disasters on a map!

To do this, we will use plotly, which is a scientific graphing library. It allows as to create awesome graphs very easily!
Additionally, one of my favorite features of plotly is that plotly let us have a continues heat bar.
Learn more about plotly.

We can alos plot multiple graphs next to each other with the argument facet_col...

Let's see how recent those disasters are...

Looking at the graphs, it seems that most of the data with coordinates are recent. This is no coincidence since measuring coordinates require some techinological advancement that is more common nawadays.

Now, lets see how our other table looks like

Drop gdp_df rows that have countries not in disas_df

Now that we explored our data, we can start the next phase of the data science pipeline

Hypothesis Testing and Machine Learning

In this stage, we will go through some topics related to machine learning and model creatoin. This section will teach you ways to predict some information based on predefined variables.

We will cover two main topics, Ordinary Least Squares and Regression analysis

Before we start. It is better to have column names that do not have white spaces here. That is because some of the function we will use next assume the column names do not have white spaces.

Also, lets add the following column to our disas_df dataframe: GDP_PC

Hypothesis Testing and Prediction with OLS


Let's say we have a null hypothesis that says that there is no relationship between the number of deaths and the GDP per capita of the country where that disaster happens. To figure out whether this hypothesis can be rejected or accepted, we want to fit our data into a linear regression model. This model will also let us predict the total deaths. There are many ways to approach this, but we can do this using the ordinary least squares method to build a linear regression model.

Deaths = (alpha * gdp_per_capita) + Intercept


Let's assume that given the gdp_per_capita, the number of deaths can be calculated with some unkown coefficient alpha. So, we are guessing that the number of deaths is correlated linearly with the gdp_per_capita.


To find out what this coefficient is, we can use the statemodels library.


Also, regarding NaN entries, in this tutorial, we are simply dropping NaN entries when we build our prediction models.

Single Parameter Graph


Let's see how our model performs...

To see if our there is a correlatoin between deaths and gdp, we can also look at the statistics of our model

To see a summary of important statisics of our model, we can simply use the summary() function!

We can see that all p-values are less than 0.05. So, we can reject the null hypothesis. The coefficeitn of the gdp per capita is around -0.0011. This makes sense because the poorer the country is, the more likely that it will have less strict safety rules for disasters.

Our Model works, but it is still not giving us good prediction (as seen in the graph). Also, since the f-value is small, we can still say that our model is not trustworthy for predicting the number of deaths.

More Parameters!!!

When we add more parameters to our model, we expect it to work better and more precisely. So, let's try to add Year as a second parameter and see what happens...

Deaths = alpha * (GDP_PC) + beta * Year + Intercept

At this time, we will try to find the coefficients alpha that will make our mod

Better! We managed to increase the f-value of our model by including Year as a second parameter! The greater the F-value, the better the model is. You can read more about F-values here.

Also, feel free to try out different parameter/s to see if there is a correlation between it/them and the number of deaths.

Interpretation: Insight & Policy Decision

In this stage, which is usually the last one, we try to reach conclusions about our findings.

By plotting our disasters and GDP dataframe, we found out many interesting information:

In this tutorial, we definitely did not cover everything we can learn about our dataset. That is basically because there is a lot to uncover! As you saw, there are many ways to visualize data, and sometimes it can be very hard, especially when we have multidimentional data to visualize. However, what helped us to understand our data was that we were able to tidy our datasets and make them so that it is easier to extract information from. Also, having good knowledge about the best ways to visualize the data is very important as we saw in the visualization stage where we had a scatter plot with y-axis as the log function.

There are so many things to uncover and countless possibilites, we can discover unexpected relationships between different variables in our data, like we saw with the number of deaths and the GDP per capita.

Mic is yours now!

Better!